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Abstract 

Solomonoff's inductive learning model is a powerful, universal and 
highly elegant theory of sequence prediction. Its critical flaw is that it 
is incomputable and thus cannot be used in practice. It is sometimes 
suggested that it may still be useful to help guide the development of 
very general and powerful theories of prediction which are computable. 
In this paper it is shown that although powerful algorithms exist, they 
are necessarily highly complex. This alone makes their theoretical analy- 
sis problematic, however it is further shown that beyond a moderate level 
of complexity the analysis runs into the deeper problem of Godel incom- 
pleteness. This limits the power of mathematics to analyse and study 
prediction algorithms, and indeed intelligent systems in general. 

1 Introduction 

Could there exist an elegant and universal theory of sequence prediction? 
Solomonoff's model of induction rapidly learns to make optimal predictions 
for any computable sequence, including probabilistic ones [131 [Tl] . Indeed the 
problem of sequence prediction could well be considered solved [HI El, if it were 
not for the fact that Solomonoff's theoretical model is incomputable. 

Among computable theories there exist powerful general predictors, such as 
the Lempel-Ziv algorithm |S] and Context Tree Weighting that can learn to 
predict some complex sequences, but not others. Some prediction methods, such 
as the Minimum Description Length principle jl2| and the Minimum Message 
Length principle jl7| . can even be viewed as computable approximations to 
Solomonoff induction jlU| . However in practice their power and generality are 
limited by the power of compression and coding methods employed, as well 
as having a significantly reduced data efficiency as compared to Solomonoff 
induction 
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Could there exist elegant computable prediction algorithms that are in some 
sense universal, or at least universal over large sets of simple sequences? In this 
paper we explore this fundamental question from the perspective of Kolmogorov 
complexity theory and uncover some surprising implications. 

2 Preliminaries 

An alphabet ^ is a finite set of 2 or more elements which arc called symbols. In 
this paper we will assume a binary alphabet B := {0, 1}, though all the results 
can easily be generalised to other alphabets. A string is a finite ordered n-tuple 
of symbols denoted x := x\X2 ■ ■ - Xn where Vi G {1, . . . ,n}, Xi G B, or more 
succinctly, a; G B". The 0-tuple is denoted A and is called the null string. The 
expression B-" has the obvious interpretation, and B* := (J^^j^B". The length 
lexicographical ordering is a total order on B* defined as A < < 1 < 00 < 01 < 
10 < 11 < 000 < 001 < • • •. A substring of x is defined Xj;k ■= XjXj+i . . .Xk 
where 1 < j < k < n. By \x\ we mean the length of the string x, for example, 
\xj:k\ ~ k — j + 1. We will sometimes need to encode a natural number as a 
string. Using simple encoding techniques it can be shown that there exists a 
computable injective function / : N — » B* where no string in the range of / is a 
prefix of any other, and Vn G N : \ f{n)\ < log2 n + 21og2 log2 n + 1. 

Unlike strings which always have finite length, a sequence oj is an infinite 
list of symbols X1X2X3 . . . G B°°. Of particular interest to us will be the class 
of sequences which can be generated by an algorithm executed on a universal 
Turing machine: 

2.1 Definition. A monotone universal Turing machine U is defined as a 
universal Turing machine with one unidirectional input tape, one unidirectional 
output tape, and some bidirectional work tapes. Input tapes are read only, 
output tapes are write only, unidirectional tapes are those where the head can 
only move from left to right. All tapes are binary (no blank symbol) and the 
work tapes arc initially filled with zeros. We say that U outputs/computes a 
sequence w on input p, and write U{p) = cj, ifU reads all of p but no more as 
it continues to write oj to the output tape. 

We fix U and define U{p, x) by simply using a standard coding technique to 
encode a program p along with a string a; G B* as a single input string for U. 

2.2 Definition. A sequence u G B°° is a computable binary sequence if 

there exists a program q G B* that writes lo to a one-way output tape when run 
on a monotone universal Turing machine U, that is, 3g G B* : h{{q) = lo. We 
denote the set of all computable sequences by C. 

A similar definition for strings is not necessary as all strings have finite length 
and are therefore trivially computable. 

2.3 Definition. A computable binary predictor is a program p G B* that 
on a universal Turing machine U computes a total function B* — > B. 
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For simplicity of notation we will often write p{x) to mean the function 
computed by the program p when executed on hi along with the input string x, 
that is, p{x) is short hand for Uij), x). Having xi:n as input, the objective of a 
predictor is for its output, called its prediction, to match the next symbol in the 
sequence. Formally we express this by writing p{xi;n) = Xn+i- 

As the algorithmic prediction of incomputable sequences, such as the halting 
sequence, is impossible by definition, we only consider the problem of predicting 
computable sequences. To simplify things we will assume that the predictor has 
an unlimited supply of computation time and storage. We will also make the 
assumption that the predictor has unlimited data to learn from, that is, we 
are only concerned with whether or not a predictor can learn to predict in the 
following sense: 

2.4 Definition. We say that a predictor p can learn to predict a sequence 
Lo := X1X2 . . . G B°° if there exists m e N such that Vn > m : p{xi-n) = Xn+i- 

The existence of m in the above definition need not be constructive, that is, 
we might not know when the predictor will stop making prediction errors for 
a given sequence, just that this will occur eventually. This is essentially "next 
value" prediction as characterised by Barzdin ^J, which follows from Gold's 
notion of identifiability in the limit for languages [7]. 

2.5 Definition. Let P{uj) be the set of all predictors able to learn to predict 
bj. Similarly for sets of sequences S C B°°, define P{S) := Pltjes ^i^)- 

A standard measure of complexity for sequences is the length of the shortest 
program which generates the sequence: 

2.6 Definition. For any sequence uj £ B°° the monotone Kolmogorov 
complexity of the sequence is, 

K{lo) :— minjlql : U{q) = lu}, 

qeM* 

where U is a monotone universal Turing machine. If no such q exists, we define 
K{uj) := 00. 

It can be shown that this measure of complexity depends on our choice 
of universal Turing machine U, but only up to an additive constant that is 
independent of w. This is due to the fact that a universal Turing machine can 
simulate any other universal Turing machine with a fixed length program. 

In essentially the same way as the definition above we can define the Kol- 
mogorov complexity of a string x G B", written K{x), by requiring that I4{q) 
halts after generating x on the output tape. For an extensive treatment of 
Kolmogorov complexity and some of its applications see TU' or ^ . 

As many of our results will have the above property of holding within an 
additive constant that is independent of the variables in the expression, we will 
indicate this by placing a small plus above the equality or inequality symbol. 
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For example, /(x) < g{x) means that that 3c e R, Vx : f{x) < g{x) + c. When 
using standard "Big O" notation this is unnecessary as expressions are already 
understood to hold within an independent constant, however for consistency of 
notation we will use it in these cases also. 

3 Prediction of computable sequences 

The most elementary result is that every computable scqiience can be predicted 
by at least one predictor, and that this predictor need not be significantly more 
complex than the sequence to be predicted. 

3.1 Lemma. Vw e C, 3p G P{uj) : K{p) < K{uj). 

Proof. As the sequence w is computable, there must exist at least one 
algorithm that generates w. Let q be the shortest such algorithm and construct 
an algorithm p that "predicts" uj as follows: Firstly the algorithm p reads Xi-n 
to find the value of n, then it runs q to generate xi:„+i and returns Xn+i as 
its prediction. Clearly p perfectly predicts w and \p\ < \q\ + c, for some small 
constant c that is independent of w and q. □ 

Not only can any computable sequence be predicted, there also exist very 
simple predictors able to predict arbitrarily complex sequences: 

3.2 Lemma. There exist a predictor p such that VneN, 3a;GC:pG f (<^) 
and K{u)) > n. 

Proof. Take a string x such that K{x) = \x\ > 2n, and from this define a 

sequence u) := xOOOO Clearly K{lo) > n and yet a simple predictor p that 

always predicts can learn to predict u. □ 

The predictor used in the above proof is very simple and can only learn 
sequences that end with all O's, albeit where the initial string can have an 
arbitrarily high Kolmogorov complexity. It is not hard to see that more sophis- 
ticated predictors can learn to predict many other more subtle types of patterns 
which are more complex than the predictor, such as arbitrary repeating strings, 
regular or primitive recursive sequences. 

As each computable sequence can be predicted, and simple predictors exist 
which can predict arbitrarily complex sequences, we might wonder whether there 
exists a computable predictor able to learn to predict all computable sequences. 
Unfortunately, no universal predictor exists, indeed for every predictor there 
exists a sequence which it cannot predict at all: 

3.3 Lemma. For any predictor p there constructively exists a sequence u) := 
X1X2 . . . S C such that Vn G N : p{x\;n) 7^ Xn+i and K{w) < K{p). 

Proof. For any computable predictor p there constructively exists a com- 
putable sequence w = X-1X2X3 . . . computed by an algorithm q defined as follows: 
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Set xi — 1 —p{X), then X2 — I ~p{xi), then = 1 —p{xi;2) and so on. Clearly 
UJ € C and Vn G N : p{xi:n) = 1 — Xn+i- 

Let p* be the shortest program that computes the same function as p and 
define a sequence generation algorithm q* based on p* using the procedure above. 
By construction, \q*\ = \p* \ + c for some constant c that is independent of p* . 
Because q* generates w, it follows that K{lu) < \q*\. By definition K{p) — \p*\ 
and so K{uj) < K{p). □ 

Allowing the predictor to be probabilistic does not fundamentally avoid the 
problem of Lemma l3.3l In each step, rather than generating the opposite to what 
will be predicted by p, instead q attempts to generate the symbol which p is least 
likely to predict given xi-n- To do this q must simulate p in order to estimate 
Pr(p(a;i:„) — With sufficient simulation effort, q can estimate this 

probability to any desired accuracy for any xi-^- This produces a computable 
sequence ui such that Vn G N : Pr(p(2;i:„) = a:„+i|a;i:„) is not significantly 
greater than i, that is, the performance of p is no better than a predictor that 
makes completely random predictions. 

The impossibility of prediction in this more general probabilistic setting has 
been pointed out before by Dawid Hj . Specifically, Dawid notes that for any sta- 
tistical forecasting system there exist sequences which are not calibrated. Dawid 
also notes that a forecasting system for a family of distributions is necessarily 
more complex than any forecasting system generated from a single distribution 
in the family. However, he does not deal with the complexity of the sequences 
themselves, nor does he make a precise statement in terms of a specific measure 
of complexity, such as Kolmogorov complexity. The impossibility of forecast- 
ing has since been developed in considerably more depth by V'yugin jlfij . in 
particular, it is proven that there is an efficient randomised procedure produc- 
ing sequences that cannot be predicted (with high probability) by computable 
forecasting systems. 

As probabilistic prediction complicates things without avoiding this funda- 
mental problem, in the remainder of this paper we will consider only determinis- 
tic predictors. This will also allow us to see the roots of this problem as clearly 
as possible. With the preliminaries covered, we now move on to the central 
problem considered in this paper: Predicting sequences of limited Kolmogorov 
complexity. 

4 Prediction of simple computable sequences 

As the computable prediction of any computable sequence is impossible, a 
weaker goal is to be able to predict all "simple" computable sequences. 

4.1 Definition. For n e N, let C„ := {w G C : K{uj) < n}. Further, let 
Pn := P{Cn) be the set of predictors able to learn to predict all sequences in C„. 

Firstly we establish that prediction algorithms exist that can learn to predict 
all sequences up to a given complexity, and that these predictors need not be 
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significantly more complex than the sequences they can predict: 

4.2 Lemma. Mn G N, 3p e P„ : K{p) < n + 0(log2 n). 

Proof. Let /i S N be the number of programs of length n or less which 
generate infinite sequences. Build the value of h into a prediction algorithm p 
constructed as follows: 

In the fc*'' prediction cycle run in parallel all programs of length n or less 
until h of these programs have each produced fc + 1 symbols of output. Next 
predict according to the k + 1*'' symbol of the generated string whose first k 
symbols is consistent with the observed string. If two generated strings are 
consistent with the observed sequence (there cannot be more than two as the 
strings are binary and have length A: + 1), pick the one which was generated by 
the program that occurs first in a lexicographical ordering of the programs. If 
no generated output is consistent, give up and output a fixed symbol. 

For sufficiently large k, only the h programs which produce infinite sequences 
will produce output strings of length k. As this set of sequences is finite, they 
can be uniquely identified by finite initial strings. Thus for sufficiently large 
k the predictor p will correctly predict any computable sequence uj for which 
K{u}) < n, that is, p £ P„. 

As there are 2"+^ — 1 possible strings of length n or less, h < 2"+^ and thus 
we can encode h with log2 h + 2 log2 logj h = n + 1 + 2 log2(n + 1) bits. Thus, 
K{p) < n + 1 + 2 log2(n + 1) + c for some constant c that is independent of n. 
□ 

Can we do better than this? Lemma IT^ shows us that there exist predictors 
able to predict at least some sequences vastly more complex than themselves. 
This suggests that there might exist simple predictors able to predict arbitrary 
sequences up to a high complexity. Formally, could there exist p G P„ where 
n ^ K{p)l Unfortunately, these simple but powerful predictors are not possible: 

4.3 Theorem. Vn G N : p G -P„ ^ K{p) > n. 

Proof. For any n G N let p G that is, Vw G C„ : p G P{oj). By Lemma [3. 31 
we know that 3uj' G C : p ^ P{^') ■ As p ^ PW) it must be the case that 
u>' ^ Cn, that is, K{uj') > n. From Lemma IX^ we also know that K{p) > K(lo') 
and so the result follows. □ 

Intuitively the reason for this is as follows: Lemma 13.31 guarantees that 
every simple predictor fails for at least one simple sequence. Thus if we want 
a predictor that can learn to predict all sequences up to a moderate level of 
complexity, then clearly the predictor cannot be simple. Likewise, if we want a 
predictor that can predict all sequences up to a high level of complexity, then 
the predictor itself must be very complex. Thus, even though we have made the 
generous assumption of unlimited computational resources and data to learn 
from, only very complex algorithms can be truly powerful predictors. 



6 



These results easily generalise to notions of complexity that take computa- 
tion time into consideration. As sequences are infinite, the appropriate measure 
of time is the time needed to generate or predict the next symbol in the se- 
quence. Under any reasonable measure of time complexity, the operation of 
inverting a single output from a binary valued function can be performed with 
little cost. If C is any complexity measure with this property, it is trivial to 
see that the proof of Lemma [3.31 still holds for C. From this, an analogue of 
Theorem l4 . 31 for C easily follows. With similar arguments these results also gen- 
eralise in a straightforward way to complexity measures that take space or other 
computational resources into account. Thus, the fact that extremely powerful 
predictors must be very complex, holds under any measure of complexity for 
which inverting a single bit is inexpensive. 

5 Complexity of prediction 

Another way of viewing these results is in terms of an alternate notion of se- 
quence complexity defined as the size of the smallest predictor able to learn 
to predict the sequence. This allows us to express the results of the previous 
sections more concisely. Formally, for any sequence u) define the complexity 
measure, 

K{ljj) := min{|p| : p G 

pel* 

and K{ijj) := oo if P{llj) = 0. Thus, if K{uj) is high then the sequence uj is 
complex in the sense that only complex prediction algorithms are able to learn 
to predict it. It can easily be seen that this notion of complexity has the same 
invariance to the choice of reference universal Turing machine as the standard 
Kolmogorov complexity measure. 

It may be tempting to conjecture that this definition simply describes what 
might be called the "tail end complexity" of a sequence, that is, K{uj) = 
liuii^oo K {u>i-oo) ■ This is not the case. Consider again Lemma 13.21 and its 
proof. For any e N, we let yi:n be a random string, that is, K{yi-n) — n. 
From this we defined a computable sequence that was a repetition of this string, 
Lv := {yi:n)* ■ It was then proven that there exists a single predictor p which 
can predict any sequence of this form, with no restriction on how high K{u}) 
can be. From our definition of K above it is thus clear that K{iLj) = for any 
such Lo. Consider now the tail complexity of uj. As K{yi-n) = n, whenever 
i mod n = we have K{uJi;oo) > n — O(logn) (the O(logn) term comes from 
potentially saving bits due to not having to encode |yi:n|). Thus even if the 
limit Imii^oc K(uJi;oo) exists (it may oscillate), it cannot be equal to K(lu) in 
general. 

Using K we can now rewrite a number of our previous results more succinctly 
in terms of the new complexity measure. From Lemma [3.1l it immediately follows 
that, 

Vw : < k{uj) < K{uj). 
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From Lemma IS^ we know that 3c G N,Vn £ N, 3[j G C such that K{oj) < c 
and K{u;) > n, that is, K can attain the lower bound above within a small 
constant, no matter how large the value of K is. The sequences for which the 
upper bound on K is tight are interesting as they are the ones which demand 
complex predictors. We prove the existence of these sequences and look at some 
of their properties in the next section. 

The complexity measure K can also be generalised to sets of sequences, for 
S* C 1°° define k{S) := minp{|p| : p G P{S)}. This allows us to rewrite 
Lemma [4. 21 and Theorem 14. 31 as simply, 

Vn G N : n < k{Cn) <n + 0(log2 n). 

This is just a restatement of the fact that the simplest predictor capable of 
predicting all sequences up to a Kolmogorov complexity of n, has itself a Kol- 
mogorov complexity of roughly n. 

6 Hard to predict sequences 

We have already seen that some individual sequences, such as the repeating 
string used in the proof of Lemma |3.2I can have arbitrarily high Kolmogorov 
complexity but nevertheless can be predicted by trivial algorithms. Thus, al- 
though these sequences contain a lot of information in the Kolmogorov sense, 
in a deeper sense their structure is very simple and easily learnt. 

What interests us in this section is the other extreme; individual sequences 
which can only be predicted by complex predictors. As we are only concerned 
with prediction in the limit, this extra complexity in the predictor must be some 
kind of special information which cannot be learnt just through observing the 
sequence. Our first task is to show that these difficult to predict sequences exist. 

6.1 Theorem. Vn G N, 3 G C : n < k{Lj) < K{uj) <n + 0(log2 n). 

Proof. For any n G N, let Qn C be the set of programs shorter than 
n that are predictors, and let xi:k G B'^ be the observed initial string from the 
sequence uj which is to be predicted. Now construct a meta-predictor p: 

By dovetailing the computations, run in parallel every program of length 
less than n on every string in B-*^. Each time a program is found to halt on 
all of these input strings, add the program to a set of "candidate prediction 
algorithms" , called Q^. As each element of Q„ is a valid predictor and thus will 
halt for all input strings for any fc, for every n and k it eventually will be the case 
that IQJ^I = IQril- At this point the simulation to approximate Qn terminates. 
It is clear that for sufficiently large values of k all of the valid predictors, and 
only the valid predictors, will halt with a single symbol of output on all tested 
input strings. That is, 3r G N, Vfc > r : = Qn- 

The second part of the p algorithm uses these candidate prediction algo- 
rithms to make a prediction. For p £ Qn define d^{p) :— Y^^Zi \p{xi:i) — Xi+i\. 
Informally, d^{p) is the number of prediction errors made by p so far. Compute 



8 



this for all p € and then let pi S Qn be the program with minimal d''{p). 
If there is more than one such program, break the tie by letting p^. be the lexi- 
cographically first of these. Finally, p computes the value of pl{xi:k) and then 
returns this as its prediction and halts. 

By Lemma 13.31 there exists w' G C such that p makes a prediction error 
for every k when trying to predict ui' . Thus, in each cycle at least one of 
the finitely many predictors with minimal d*^ makes a prediction error and so 
Vp £ Q„ : d^{p) — » cxD as fc ^ oo. Therefore, $p & Qn ■ P & P{^'), that is, 
no program of length less than n can learn to predict oj' and so n < K{lu'). 
Further, from Lemma f3. II we know that K{u!') < K{uj'), and from Lemma f3. 31 
again, K{uj') < K{p). 

Examining the algorithm for p, we see that it contains some fixed length 
program code and an encoding of |Q„|, where |Q„| < 2" — 1. Thus, using a 
standard encoding method for integers, K{p) <n + 0(log2 n). 

Chaining these together we get, n < K{uj') < K{uj') < K{p) < n + 0(log2 n), 
which proves the theorem. □ 

This establishes the existence of sequences with arbitrarily high K complex- 
ity which also have a similar level of Kolmogorov complexity. Next we establish 
a fundamental property of high K complexity sequences: they are extremely 
difficult to compute. 

For an algorithm q that generates w G C, define tq{n) to be the number of 
computation steps performed by q before the v}^ symbol of uj is written to the 
output tape. For example, if g is a simple algorithm that outputs the sequence 
010101 . . ., then clearly tqin) = 0{n) and so uj can be computed quickly. The 
following theorem proves that if a sequence can be computed in a reasonable 
amount of time, then the sequence must have a low K complexity: 

6.2 Lemma, ^oj e C, if 3q : U{q) = uj and 3r G N, Vn > r : tq{n) < 2", then 
k{uj) = 0. 

Proof. Construct a prediction algorithm p as follows: 

On input xi-.m run all programs of length n or less, each for 2"+^ steps. In 
a set Wn collect together all generated strings which are at least n+1 symbols 
long and where the first n symbols match the observed string xi.n- Now order 
the strings in Wn according to a lexicographical ordering of their generating 
programs. If Wn = 0, then just return a prediction of 1 and halt. If \Wn\ > 1 
then return the n + 1*'' symbol from the first sequence in the above ordering. 

Assume that 3q : U{q) = uj such that 3r e N,Vn > r : tq{n) < 2". If q 
is not unique, take q to be the lexicographically first of these. Clearly Vn > r 
the initial string from uj generated by q will be in the set Wn- As there is no 
lexicographically lower program which can generate uj within the time constraint 
tq{n) < 2" for all n > r, for sufficiently large n the predictor p must converge on 
using q for each prediction and thus p G P{^)- As \p\ is clearly a fixed constant 
that is independent of w, it follows then that K{uj) < \p\ =0. □ 
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We could replace the 2" bound in the above result with an even more rapidly 
growing computable function, for example, 2^ . In any case, this does not 
change the fundamental result that sequences which have a high K complexity 
are practically impossible to compute. However from our theoretical perspec- 
tive these sequences present no problem as they can be predicted, albeit with 
immense difficulty. 

7 The limits of mathematical analysis 

One way to interpret the results of the previous sections is in terms of construc- 
tive theories of prediction. Essentially, a constructive theory of prediction T, 
expressed in some sufficiently rich formal system is in effect a description of 
a prediction algorithm with respect to a universal Turing machine which imple- 
ments the required parts of T. Thus from Theorems 14 . 31 and lfi . II it follows that if 
we want to have a predictor that can learn to predict all sequences up to a high 
level of Kolmogorov complexity, or even just predict individual sequences which 
have high K complexity, the constructive theory of prediction that we base our 
predictor on must be very complex. Elegant and highly general constructive 
theories of prediction simply do not exist, even if we assume unlimited compu- 
tational resources. This is in marked contrast to Solomonoff's highly elegant 
but non-constructive theory of prediction. 

Naturally, highly complex theories of prediction will be very difficult to 
mathematically analyse, if not practically impossible. Thus at some point the 
development of very general prediction algorithms must become mainly an ex- 
perimental endeavour due to the difficulty of working with the required theory. 
Interestingly, an even stronger result can be proven showing that beyond some 
point the mathematical analysis is in fact impossible, even in theory: 

7.1 Theorem. In any consistent formal axiomatic system T that is sufficiently 
rich to express statements of the form "p G P„ there exists m G N such that 
for all n > m and for all predictors p G Pn the true statement "p G Pn " cannot 
be proven in T . 

In other words, even though we have proven that very powerful sequence 
prediction algorithms exist, beyond a certain complexity it is impossible to find 
any of these algorithms using mathematics. The proof has a similar structure 
to Chaitin's information theoretic proof "3' of Godcl incompleteness theorem for 
formal axiomatic systems j6j. 

Proof. For each n G N let T„ be the set of statements expressed in the formal 
system T of the form "p G P„" , where p is filled in with the complete description 
of some algorithm in each case. As the set of programs is denumerable, r„ is 
also denumerable and each element of T„ has finite length. From Lemma 14.21 
and Theorem 14. 31 it follows that each T„ contains infinitely many statements of 
the form "p G P„" which are true. 

Fix n and create a search algorithm s that enumerates all proofs in the 
formal system T searching for a proof of a statement in the set T„. As the set 
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T„ is recursive, s can always recognise a proof of a statement in r„. If s finds 
any such proof, it outputs the corresponding program p and then halts. 

By way of contradiction, assume that s halts, that is, a proof of a theorem 
in Tn is found and p such that p G P„ is generated as output. The size of 
the algorithm s is a constant (a description of the formal system !F and some 
proof enumeration code) as well as an 0(log2 n) term needed to describe n. It 
follows then that K{p) < 0{\og2n). However from Theorem 14.31 we know that 
K{p) > n. Thus, for sufficiently large n, we have a contradiction and so our 
assumption of the existence of a proof must be false. That is, for sufficiently 
large n and for all p G P„, the true statement "p G P„" cannot be proven within 
the formal system JF. □ 

The exact value of m depends on our choice of formal system !F and which 
reference machine U we measure complexity with respect to. However for rea- 
sonable choices of !F and U the value of m would be in the order of 1000. That 
is, the bound m is certainly not so large as to be vacuous. 



8 Discussion 

Solomonoff induction is an elegant and extremely general model of inductive 
learning. It neatly brings together the philosophical principles of Occam's razor, 
Epicurus' principle of multiple explanations, Bayes theorem and Turing's model 
of universal computation into a theoretical sequence predictor with astonishingly 
powerful properties. If theoretical models of prediction can have such elegance 
and power, one cannot help but wonder whether similarly beautiful and highly 
general computable theories of prediction are also possible. 

What we have shown here is that there does not exist an elegant constructive 
theory of prediction for computable sequences, even if we assume unbounded 
computational resources, unbounded data and learning time, and place mod- 
erate bounds on the Kolmogorov complexity of the sequences to be predicted. 
Very powerful computable predictors are therefore necessarily complex. We 
have further shown that the source of this problem is computable sequences 
which are extremely expensive to compute. While we have proven that very 
powerful prediction algorithms which can learn to predict these sequences exist, 
we have also proven that, unfortunately, mathematical analysis cannot be used 
to discover these algorithms due to problems of Godel incompleteness. 

These results can be extended to more general settings, specifically to those 
problems which are equivalent to, or depend on, sequence prediction. Consider, 
for example, a reinforcement learning agent interacting with an environment 
|15ll5). In each interaction cycle the agent must choose its actions so as to max- 
imise the future rewards that it receives from the environment. Of course the 
agent cannot know for certain whether or not some action will lead to rewards 
in the future, thus it must predict these. Clearly, at the heart of reinforcement 
learning lies a prediction problem, and so the results for computable predic- 
tors presented in this paper also apply to computable reinforcement learners. 
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More specifically, from Theorem 14.31 it follows that very powerful computable 
reinforcement learners are necessarily complex, and from Theorem 17.11 it fol- 
lows that it is impossible to discover extremely powerful reinforcement learning 
algorithms mathematically. 

It is reasonable to ask whether the assumptions we have made in our model 
need to be changed. If we increase the power of the predictors further, for 
example by providing them with some kind of an oracle, this would make the 
predictors even more unrealistic than they currently are. Clearly this goes 
against our goal of finding an elegant, powerful and general prediction theory 
that is more realistic in its assumptions than Solomonoff 's incomputable model. 
On the other hand, if we weaken our assumptions about the predictors' resources 
to make them more realistic, we are in effect taking a subset of our current class 
of predictors. As such, all the same limitations and problems will still apply, as 
well as some new ones. 

It seems then that the way forward is to further restrict the problem space. 
One possibility would be to bound the amount of computation time needed 
to generate the next symbol in the sequence. However if we do this without 
restricting the predictors' resources then the simple predictor from Lemma [6.21 
easily learns to predict any such sequence and thus the problem of prediction in 
the limit has become trivial. Another possibility might be to bound the memory 
of the machine used to generate the sequence, however this makes the generator 
a finite state machine and thus bounds its computation time, again making the 
problem trivial. 

Perhaps the only reasonable solution would be to add additional restrictions 
to both the algorithms which generate the sequences to be predicted, and to the 
predictors. We may also want to consider not just learnability in the limit, but 
also how quickly the predictor is able to learn. Of course we are then facing a 
much more difficult analysis problem. 
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